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(57) Abstract: Class network routing 
is emplemented in a network such as a 
computer network comprising a plurality 
of parallel compute processors at nodes 
(Q00-Q22) thereof. Class network 
routing allows a compute processor to 
broadcast a message to a range (one or 
tfl4 more) of other compute processors in 

the computer network, such as processors 
in a column or a row. Normally this type 
of operation requires a separate message 
to be sent to each processor. With class 
network routing pursuant to the invention, 
a single message is sufficient, which 
generally reduces the total number of 
messages in the network as well as the 
latency to do a broadcast. Class network 
routing is also applied to dense matrix 
inversion algorithms on distributed 
memory parallel supercomputers (Fig. 1) 
with hardware class function (multicast) 
capability. This is achieved by exploiting 
the fact that the communication patterns 
of dense matrix inversion can be served 
by hardware classe functions, which 
results in &ster execution times. 
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CLASS NETWORK ROUTING 

5 

CROSS-REFERENCE 

The present invention claims the benefit of commonly-owned, co-pending United 
States Provisional Patent Application Serial Number 60/271,124 filed February 24, 
2001 entifled MASSIVELY PARALLEL SUPERCQMPUTER, flie whole contents 

10 and disclosure of which is e^qxressly incorporated by reference herein as if fully set 
forth herein. This patent application is additionally related to the following 
commonly-owned, co-pending United States Patent Applications filed on even date 
herewith, the entire contents and disclosure of each of which is expressly 
incorporated by re&rence herein as if fiilly set forth herein. U.S. patent application 

15 Serial No. (YOR920020027US1, YOR920020044US1 (15270)), for "Class 
Networking Routing"; U.S. patent application Serial No. (YOR920020028US1 
(15271)), for "A Global Tree Network for Computing Stiiictures"; U.S. patent 
appUcation Serial No. (YOR920020029US1 (15272)), for 'Global Memipt and 
Barrier Networks"; U.S. patent application Serial No. (YOR920020030US1 

20 (15273)), for 'Optimized Scalable Network Switch"; U.S. pateiat application Serial 
No. (YOR920020031US1, YOR920020032US1 (15258)), for "Arithmetic Functions 
in Torus and Tree Networks'; U.S. patent application Serial No. 
(YOR920020033US1, YOR920020034US1 (15259)), for 'Data Capture Technique 
for ffigh Speed Signaling"; U.S. patent appUcation Serial No. (YOR920020035US1 

25 (1 5260)), for 'Managing Coherence Via Put/Get Windows'; U.S. patent application 
Serial No. (YOR920020036US1, YOR920020037US1 (15261)), for "Low Latency 
Memory Access And Synchronization"; U.S. patent application Serial No. 
(YOR920020038US1 (15276); for 'Twin-Tailed Fail-Over for Fileservers 
Maintaining Full Performance in the Presraice of Failure"; U.S. patent application 

3D Serial No. (YOR920020039US1 (15277)), for "Fault Isolation Tlirough No- 
Ovetfaead Link Level Checksums'; U.S. patent application Serial No. 
(YOR920020040US1 (15278)), for "Efliemet Addressing Via Physical Location for 
Massively Parallel Systems"; U.S. patent application Serial No. 
(YOR920020041US1 (15274)), for 'Tault Tolerance in a Supercomputer Througji 

35 Dynamic Rqjartitioning"; U.S. patent application Serial No. (YOR920020042US 1 
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(15279)), for "Checlqwinting FUesystran"; U.S. patent application Serial No. 
(YOR920020043US1 (15262)), for "Efficient finplementation of Multidimensional 
Fast Fourier Transform on a Distributed-Motnocy Parallel Multi-Node Computer"; 
U.S. patent appUcation Serial No. (YOR9-2001021 1US2 (15275)), for "A Novel 
5 Massively Parallel Supercomputer"; and U.S. patent application Serial No. • 
(YOR920020045US1 (15263)), for "Smart Fan Modules and System". 

BACKGROUND OF THE INVENTION 
10 1 . Field of the Ihveiitioii 

The present invention relates generally to a class network routing, and more 
particularly pertains to class network routing which implements class routing in a 
network such as a computer network comprising a plurality of parallel compute 

15 processors at nodes thereoj^ and which allows a compute processor to broadcast a 
message to one or more other compute processors in the computer network, such as 
processors in a column or a row. Normally this type of operation requires a separate 
messajge' to be seat to each processor. With class network routing pursuant to the 
invention, a single message is sufBdent, whidi generally reduces the total number 

20 of messages in the network as well as the latency to do a multicast 

The present invention relates to the field of message-passing data networks, for 
example, a network as used in a distributed-memory message-passing, parallel . 
computer, as applied f^ example to computation in the field of life sciences. 
25 • 

The present invention also uses the class function on a torus computer network to do 
dense matrix calculations. By using the hardware implemented class fonction'on the 
torus computer networic it is possible to do high performance dense matrix 
calculations. 

30 

The present invention also relates to the field of distributed-memory, message- 
passing parallel computer design and ^rstem software, as applied for example to 
computation in the field of life sciences. More specifically it relates to the field of 



2 



wo 02/069550 



PCT/US02/05573 



high perfonnance linear algdxra software for distributed memoiy parallel 

supercomputers. 

2. Piscnssioii of the Prior Art 

S A large class of important computations can be perfinmed by massively parallel 
computer systrans. Such systems consist of many compute nodes, each of which 
typically consist of one or more CPUs, memoty, and one or more network inter&ces 
to connect it with other nodes. 

The computer described in related U.S. provisional application Serial No. 

10 60/271,124, filed February 24, 2001, for A Massively Parallel Supercomputer, 
leverages system-on-a-chip (SOC) technology to create a scalable cost-efficient 
computing system with high throughput SOC technology has made it feasible to 
build an entire multiprocessor node on a single chip using libraries of embedded 
components, including CPU cores with integrated, first-level caches. Such 

1 S packaging greatly reduces the components count of a node, allowing for the creation 
of a reliable, large-scale machine. 

A message-passing data network serves to pass messages between nodes of a 
network, each of which can perform local operations independently of other nodes. 

20 Nodes can act. in concot by passing messages between them over the network. An 
example of such a network is a distributed-memory parallel computer wherein each 
of its nodes has one or more processors that operate on local memoiy. An 
application using multiple nodes of such a computer coordinates the actions of the 
noultiple nodes by passing messages between them. The words switch and router are 

25 used interchangeably throughout this specification. 

A message-passing data network consists of switches and links, wherein a link 
merely passes data between two switches. A switch routes incoming data Scorn a 
node or link to another node or link. A switch may be connected to an arbitrary 
30 number of nodes and links. Depending on their location m the network, a message 
between two nodes may need to traverse sevoal switches and links. 



Prior art networks efficiently support some types of message passing, but not all 
types. For example, some networks efficiently support unicast message passing to a 
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single receiving node, but not multicast message passing to an arbitrary number of 
receiving nodes. Efficient support of multicast message passing is required in 
various situations, such as numerical algorithms executed on a distributed-memory 
parallel computer, which is a requirement in the applications disclosed herein for 
5 dense matiice inversion using class fimctions. 

Many user applications need to invert very large N by N (NxN) dense matrices, 
where N is greater than several thousand. Dense matrices are matrices that have 
most of their entries being non-zero. Typically, inversion of such matrices can only 
10 be done using large distributed memory parallel supercomputers. Algorithms that 
p^orm dense matrix inversions are well known and can be generalized for use in 
distributed memory parallel supercomputers. In that case a large amount of inter- 
processor communication is required. This can slow down the application ' 
considerably.. 

15 

SUMMARY OF THE INVENTION 

Aiccordingly, it is a primary object of the present invention to provide dass n^oik 
routing which implem^ts dass routing in a network which allows a compute 
processor to broadcast a message to a range of processors, such as processors in a 

20 coluirm or a row. Normally this type of operation requires a separate message to be 
sent to each processor. With class routing pursuant to the present invention, a single 
message is sufficient, which generally reduces the total number of messages in tiie 
network as well as the latency to do a broadcast The class network routing 
enhances a network such that it more effidently supports.some additional types of 

25 message passing. 

Class routing enhances a network to more effidently support additional types of 
message passing. As usual, a message is divided into one or more packets which 
pass atomically througih flie network. Class routing adds a dass value to each packet 
30 At eadi switch, tiie class value is used as an index to one or more tables, whose 
stored values determine the actions performed by the switch on the packet An 
ind«c-based table-lookiq) is fast and effident, as required for maximal throug^bput 
and minimal latency across a switoh. 

4 . 
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Class routing can be siunmarized as an efficient encoding and decoding of 
information needed by a switch to act on a packet, to enable the network to provide 
certain types of message passing. The information is encoded in the class value of 
the packet and in the tables of the switches. The infonnation is decoded by using the 
S class value of a pack^ as an index to the tiibles. 

A network without class routing is refiscred to as a basic network. With class routing, 
it is an enhanced network. Witii tiie appropriate entries in the class tables of all Has 
switches, one or more classes of the enhanced network can provide the message- 
1 0 passing types of the basic network. Moreover, since using the class value of a packet 
as an index to a table is fast, the message-passing types of the basic n^ork are not 
appreciably slowed down by the enhancement when compared witibL the basic 
network. 

15 Other entries in the class tables can provide message-pasjsunig types ijeyond those of 
the basic network. For example, the unicast message passing of a basic network can 
be enhanced by class routing to path-based multidrop message p^ing for 
multiphase multicasting. 

20 In Ihe classes described above, the enhanced network providbs the message-passing 
types of the basic network, either urmiodified or eohanc^. In addition, some classes 
of the enhanced network could override the basic network. For example, overriding 
classes can provide multidestination message passing.for single-phase multicasting. 
If class routing provides the only message-passing types, then no underlying basic 

25 network is reqiured. 

The present invoition makes dense matrix inversion algorithms on distributed 
memory parallel sup^computers with hardware class function capability perform 
faster. A hardware class function is a particular use of class routing. This is 
30 achieved by exploiting the fact tbsat the communication patterns of dense matrix 
inversion can be served by hardware class functions. This results in &ster execution 
times. 
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If the parallel supercomputer possesses class function capability at the hardware 
level, then the particular communication patterns of dense matrix inversion can be 
exploited by using class functions in order to miniinize the communication delay. 
For example, provisional application Serial No. 60/271,124 describes a computer 
5 with function capability at the hardware level. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing objects and advantages of the present invention for a class network 
routing may be more readily understood by one skilled in the art with reference 
10 bemg had to the following detailed description of several embodiments thereof 
taken in conjunction with the accompanying drawings wherein like elements are 
designated by identical refer^ce numerals throughout the several views, and in 
which: 

Figure 1 illustrates an exemplary distributed-memoiy parallel supercomputer that 
1 S includes 9 nodes interconnected via a multidimensional grid utilizing a 2- 
dimensional 3x3 Torus network according to the present invention; 

Figure 2 illustrates in more detail an exemplary node QOO of &e nine nodes of &e 
distributed-memory parallel supercomputer of Figure 1 ; 

20 

Figure 3 illustrates an exemplary single phase multicast Scorn node QOO to ttie other 
8 nodes of the distributed-memory parallel supercomputer illustrated in Figure 1. 

Figure 4 illustrates a 4 x 4 grid of processors wherein each processor is labeled by its 
25 row, colunm numerals. 

DETAILED DESCRIPTION OF THE INVENTION 

The distributed-memory parallel supercomputer described in U.S. provisional 
application Serial No. 60/271,124 comprises a plurality of nodes. Each of the nod^ 
30 includes at least one processor, which operates on a local memory. The nodes are 
interconnected as a multidimensional grid and they communicate via grid links. 
Without losing generality and in ordCT to make the description of this invention 
easily understandable to one skilled in the art, the multidimensional node grid will 

6 
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be desoibed as an exemplary Z-dimeDsional grid or an exemplary 3-dimensional 
grid. The 3-dimaisional grid is implemented by a Torus-based architecture. 
Notwithstanding the &ct that only the 2-dimeiisional node grids or 3-dimeDsional 
node grids are described in the following description, it is contemplated within the 
S scope of the present invention that grids of other dimensions may easily be provided 
based on the teachings of the present invention. An example of 3 dimensions is flie 
3-dimensional grid implemented on the Torus-based architecture desoibed in 
provisional ^plication Serial No. 60^71,124. 

1 0 Figure 1 is an exemplary illustration of a distributed-memory parallel supercomputer 
that includes 9 nodes interconnected via a multidimensional grid utilizing a 2- 
diinensional 3x3 Torus network 100. It is noted that the number of nodes is in 
exemplary fashion limited to 9 nodes for brevity and clarity, and that the number of 
nodes may sigmiicantly vary depending on a particular architectural requirements 

15 for the distributed-memory parallel supercomputer. Figure 1 dq)icts 9 nodes labeled 
as QOO - Q22, a pair of which is interconnected by a grid link. In total, the 9-node 
Torus network 100 is interconnected by 1 8 grid links, where each node is diredtly 
interconnected to four other nodes ia the Torus network 100 via a respective grid 
link. It is noted that imlike a mesh, the exemplary 2-dimensional Torus network 

20 100 includes no edge nodes. For example, node QOO is interconnected to node Q20 
via grid link 102; to node Q02 via grid link 104; to node QIO via grid link 106; and 
finally to node QOl via grid link 108. As another example, Node Ql 1 is 
interooimected to Node QOl via grid Hnk 1 10; to node QIO via grid link 1 12; to 
node Q21 via grid link 1 14 and finally to Node Q12 via grid Imk 116. O&iec nodes 

25 are interconnected in a similar feshion. 

Data communicated between nodes is transported on the network in one or more 
packets. For any given communication, more than.one packet is needed if the 
amount of data exceeds the packet-size sapported by the network. A padket consists 
30 of a packet header followed by the data carried by tiie packet The paskdt header 
contains information required by the torus network to transport the packet firan the 
source node of the packet to the destination node. In a distributed-memory parallel 
supercomputer, that is implemented by the assignee of tiie present patent application. 
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each node on the network is identifLed hy a logical address and the packet header 
includes a destination address so that the packet is automatically routed to a node on 
the network as identified by a destination. 

5 Figure 2 is an exemplary illustration of node QOO of the distributed-memory parallel 
supercomputer of Figure 1 . The node is similar to that in provisional application 
Serial No. 60^71,124. The node contains one processor which operates on local 
memory. The node contains a routo: which sends and receives packets on the grid 
links 102,104,106,108 connecting the node QOO to its neighboring nodes 

10 Q20,Q02,Q10,Q01, respectively, as illustinted m Figure 1. The node contains a 
reception buffer. If the router receives a packet destined for the local processor, the 
packet is placed into the reception haSSer, fiom which the packet can be received by 
the processor. Depending on the application and the padcet, tiie processor may write 
the contents of the packet into memory. The node contains an injection buffers 

15 which operates in a first-in, first-out (FIFO) manner. If the CPU places a packet into 
an injection FIFO, once tiie packet reaches the head of the FIFO, the packet is 
removed from the FIFO by the router and the router places the packet onto a grid 
link toward the destmation node of the packet 

20 The routing implemented by the router has several simultaneous characteristics. The 
characteristics are some of those described in provisional application Serial No. 
60/271,124. The routing is a virtual cut-toough routing. Thus if an incoming packet 
on one of the grid links is not destined for the processor, then the packet is 
forwarded by the router onto one of the outgoing links. This forwarding is 

25 performed by the router without the involvement of the processor. The routing is a 
shortest-path routing. For example, a packet sent by node QOO to node Q02 will 
travel over the grid link 104. Any other path would be longer. For another example, 
a packet sent by node QOO to node Ql 1 will travel over the grid links 106 and 1 12 or 
over the grid links 108 and 110. The routing is an adaptive routing. There may be a 

30 choice of grid links by which a packet can leave a node. In the previous example, the 
packet could leave the node QOO via the grid link 106 or 108. For a packet leaving a 
node, adaptive routing allows the router to choose the less busy outgoing link for a 
packet or to choosie the outgoing link based on some other ociteria. Adaptive touting 
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is not just perfoimed at fher source node of apacket; adaptive routing also is 
performed at each intermediate node that a packet may cut through on the packet's 
way to the packet's destination node. 

5 Class routing can be used to achieve a wide variety of types of message passing. 
Some of these types are described in the following examples which desoibe many 
details of class routing. 

Example 1. Path-based multidrop message passing: 

10 The network of a distributed-memory parallel computer is an example of a message- 
passing data network. Each node of such a computer has one or more processors that 
operate on their local memory. An application using multiple nodes of sudb a 
computer coordinates theh actions by passing messages between fhem. An example ' 
of such a computer is described in provisional application Serial No. 60/271,124 for 

15 A Massively Parallel Supercomputer. In that computer, each single node is paired 
with a single switch of the network. In that computer, the switdies are connected to 
each other as a three dimensional (3D) torus. Thus in that computer, eadbi switch is 
linked to six other switdies. These links {(re to a switch in the positive direction and 
to a switch in the negative direction in eadbi of the three dimensions. Each switch is 

20 identified by its (x, y, z) logical address on the 3-dimensional torus. By contrast, in a 
computer using a 2-dimensional torus, each switch is identified by its (x, y) logical 
address. In Figure 1, the positive X direction is towards the right, and the positive Y 
direction is towards the bottom. In Figure 1, node QOO has the logical address (0,0), 
node QOl has logical address (0,1) and so on. Smce each node is paired with a single 

25 switch, a node has the address of its switch. By including a field for such a logical 
address in the packet header, the packet can efBciently and conveniently identify its 
destination node. Without class routing, the basic network only provides unicast 
message passing. If a switch is the destination of an incoming packet, then the 
packet is given to the local node. Otherwise, the packet is put onto a link towards to 

30 the destination node. 

The foUowiag is an example using class routing to implement multidrop message 
passing. Each packet header has a field for a class value. This value is either 0 or 1 . 
Each switch has a table used to determine if, in addition to the usual unicast routing 

9 
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of the packet, a copy should be dqposited at the local node. This assumes for the 
original unicast message passing, that the processor is not involved when the router 
forwards a packet from one of the incoming links to one of the outgoing links. This 
assumption is satisfied by virtual cut-through routing, as implemented for example 
5 in provisional application Serial No. 60/27 1 , 1 24. This assumes for the original 

unicast message passing, that the processor is not involved when the router forwards 
a packet from one of the incoming links to one of the outgoing links. This 
assumption is satisfied by virtual cut-throug^i routing, as implemented for example 
in the provisional application Serial No. 60/271,124. For the class values [0,1], the 

10 entries in this deposit table are [0,1] and demand that the packet is not deposited or 
deposited, respectively. The table is illustrated below. The table only applies for a 
packet at a node other than its destination node. A packet at its destmation node is 
deposited as in the usual unicast routing. Thus packets with class value 0 obey the 
original unicast message passing. Packets with class value 1 perform path-based 

IS multidrop message passing. 



For a packet NOT destined for this node 


class value 


deposit value 


0 


0 


1 


1 



Path-based multidrop message passing can be used to implemaat multiphase 
20 multicasting, as described for example in JD.ICPanda, S.Singal and P.Prabhakaran, 
'^ultidestination Message Passing Mechanism Conforming to Base Wormhole 
Routing Scheme", PCRCW'94, LNCS 853, Springer-Verlag,pp.l31-145, 1994_. 
The first example described here is a two phase multicast from node (0,0) to the 9 
nodes of the 3*3 torus illustrated in Figure 1 . In the first phase, node (0,0) sends a 
25 multidrop message wifli destination (0,2). In the second phase, eadi of the 3 

recipients of the first phase simxiltaneously send a multidrop message. Node (0,0) 
sends to (2,0); node (0,1) to (2,1) and node (0,2) to (2,2). At the end of the second 
phase, all 9 nodes of &e 2-dimensional torus have received the broadcast message. 

30 The above assumes that in the original unicast message passing, when the source 
node and destination node are in the same row, then the path of the packet is along 

10 



wo 02/069550 



PCT/US02/05573 



fliat row. A row is a group of nodes which have equal values for all hut one of flie 
dimensions of the torus or mesh. The assumption is guaranteed by shortest-path 
routing, as implanented for emaple in provi^onal application Serial No. 
60/27 1 , 1 24. The above assumption also is guaranteed by the deterministic routing 
5 implemented in the provisional application. By contrast, the above assumption is not 
satisfied by the congestion avoidance routing implemented elsewhere, which routes 
a packet via some random node. 

The second example described here is a three phase multicast from node (0,0,0) to 
10 the 125 nodes of the 5*5*5 cube with the comeis (0,0,0) and (4,4,4). In the first 

phase, node (0,0,0) sends a multidrop message witii destination (0,0,4). hi the 

second phase, each of fhs 5 recipirats of the first phase simultaneously send a 

multidrop message. Node (0,0,0) sends to (0,4,0); node (0,0,1) to (0,4,1) and so on. 

In the third phase, each of the 2S redpieats of the second phase simultaneously send 
15 a multidrop message. Node (0,0,0) sends to (4,0,0); node (0,0,1) to (4,0,1) and so on. 

At the end of the third phase, all 125 nodes of the cube have received &e broadcast 

message. 

The above example of a 3-phase multicast for the 3-dimensional cube is easily 
20 generalized as follows. For a D-phase multicast fi-om an origin node to all nodes of a 
D-dimensional cube wherein, in a first phase the origin node sends a multidrop 
message to all other nodes in one of the rows of the sending node, in a second phase 
each of the recipients of the first phase and the sender of the first phase 
simultaneously send a multidrop message to all other nodes in a row orthogonal to 
25 the row of fbs first phase, in a third phase each of the recipients of the second phase 
and the soaders of the second phase simultanequsly send a multidrop message to all 
o&er nodes in a row orthogonal to the rows of the first and second phases, and so on 
in fiirtho: phases such that all node of the cube receive the broadcast message after 
all the phases. 

30 

The implemratation of path-based multidrop message passing using class routing 

offers advantages beyond existing implementations. For example, a particular 

existing implementation places the dq>osit value into the packet In tiiat 

implementation, everv node on the path of the packet receives a copy of tiie packet 

11 
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Iq contrast, since each sivitch can have different ootries in its deposit table, class 
routing allows a node with the deposit entries [0,0] to not receive a copy of a packet, 
even though the node is on the path of the multidrop packet. The table is illustrated 
below. For example, with several class values for multicasting, this allows for 
5 several multicast groups, each wi& a different set of nodes. 



For a packet NOT destined for this node 


class value 


deposit value 


0 


0 


1 


0 



Example 2. Sending multidrop packets without knowing the recipients 

As described in Example 1, class routing allows a node with the deposit entries [0,0] 
for dass values [0,1] to not receive a copy of a packet, even though the node is on 
the path of the multidrop packet. This information need not be known by the source 
node of the multidrop packet In other words, class routing allows a node to source a 
multidrop packet without knowing the recipients. However, in the network of 
Example 1, there is one excqption, the destmation node of the multidrop padcet 
always will receive a copy of the packet Thus if the destination node is to not 
receive a copy of the packet, this must be known by the source node such that it can 
use anolbier destination. 

For example, assume node (0,0) is the source of a multidrop packet originally 
destined for node (0,2). This may be a natural destination on a torus network of size 
3*3, since nodes (0,0) through (0,2) are a complete row. If node (0,2) is to not 
receive a copy, then this must be known by node (0,0). If node (0,0) also knows that 
node (0,1) is to receive a copy, then (0,1) can be used as the destmation of the 
multidrop padcet 

In order to solve the exception caused by the destination node, class routing allows 
each switch to have an additional table which determines if a copy of a packet 
30 should be deposited at the destination node. To solve the above example, for node 
(0,2) the entries in this destination table are [1 ,0] for the class values [0, 1 ]. The entry 
0 for class 1, causes node (0,2) to not receive multidrop messages, even if it is the 

12 
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destination. Tbe entry 1 for class 0 allows node (0,2) to receive nnicast messages as 
usual. The two tables are illustrated below. 



For a packet destined for this node (0,2) 


class value 


deposit value 


0 


1 


1 


0 




For a padket NOT destined for this node (0,2) 


class value 


deposit value 


0 


0 


1 


0 



In the above example, node (0,2) is not a participant in the multicast with class valiie 
1. 

10 As a contrasting example, node (0, 1 ) is a participant in the multicast with class value 
1 . The corresponding tables for node (0, 1) are illustrated below. 



For a packet destined for ^s node (0,1) 


. class value 


deposit value 


0 


1 ' 


1 


1 




For a padcet NOT destined for this node (0,1) 


class value 


deposit value 


0 


0 


1 


1 


Example 3. Snooping: 



1 S Assume the network described above in Example 1, including its use of ttie class . 
value 0 for the unicast messages of the basic network. A node can snoop, and 



acquire and store information on the unicast packets passing through its switch by 
using the entry 1 for class value 0 in the deposit table. 

20 The table is illustrated below. En the example, the node is a participant in the 

multicast with dass value 1 . The table only apples for a packet at a node other than 
its destination node. In this example, a packet at its destination node is dq>osited as 
in the usual imicast routing. 
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For a packet NOT destined for tbis node 


class value 


deposit value 


0 


1 


1 


1 



An example use of such snooping is the investigation of tibe performance of the 
network. Without snooping tiiare may only be information on when the packet 
entered the network at the source node and whea it exited at Has destination node. 
5 With snooping, there can be information on when the packet passed through a node 
on the path of the packet Since there may be multiple valid paths between a pair of 
nodes, snooping also can provide information on which particular path was used. 
An example of a routing with multiple valid paths between a pair of nodes is 
adaptive routing, as implemented tor example in provisional application Serial No. 
10 60/271,124. 

Since each switch can have dififCTent entries*in its deposit table, class routing allows 
an arbitrary number of nodes to be snoopmg. If only a small fraction of nodes in tiie 
network are snooping, then the measurements are a statistical sampling. 

15 

Snooping is an example use of class routing not specifically related to multicasting. 
Exanqile 4. Single Phase Multicast 

In a single phase multicast, the message is injected once into the networic by one of 
the nodes. In contrast, in a multiphase nmlticast, the message is injected several 
20 times into the network, perhaps by multiple nodes. For example, in the multiphase 
multicast on fbs 3*3 node torus described above in Example 1, the message is 
injected a total of l+3°«4 times by 3 different nodes. For example, in the multiphase 
multicast on the 5i'5*5 node torus desoibed above in Example 1, the message is 
injected a.total of 1+5+25=31 times by 25 different nodes. 

25 

As well known, to provide single phase multicast, a switdh must be able to duplicate 
an incoming packet onto multiple outgoing links. In essence, the message 
duplication performed by a node in multiphase multicasting is performed by a 
switch in single phase multicasting. 

■ 14 
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The advantage offered by class routing for single phase multicasting is an effideot 
encoding and decoding of which of the outgoing switches do or do not receive a 
copy of a particular incoming packet. After a simple example describing fbs 
5 .encoding and decoding scheme of^ed by class routing, the scheme is compared to 
existing schemes. 

The &st example described here is the same multicast described in Example 1 from 
node (0,0) to the 9 nodes of the 3*3 torus illustrated in Figure 1. In Example 1 it is a 
1 0 two phase multicast; here it is a single phase multicast. Here the pattan of messages 
across the network is chosen to be similar to that of Example 1 . 

Each packet header has a field for a class value. This value is either 0 or 1 . Each 
switch has a table used to determine if the usual unicast routing of the packet is to be 

IS performed or if the actions of single phase multicast routing are to be performed. 
Each entry in the table is a bit string of the format UDXY. If m a table entry U is 1 , 
then the usual unicast routing is to be performed, otibierwise not. If D is 1 , then a 
copy of the packet is to be deposited at the local node, otherwise not If X is 1, th^ 
a copy of the pack^ is to go out the positive X link, otherwise not If Y is 1 , then a 

20 copy of the padcet is to go out the positive Y link, otherwise not The tv\^ links in 
the negative X and Y duection are irrelevant to the example and are ignored here for 
simplicity. 

For class value 0, the entry in the table is 1000 on all nodes. Thus packets with class 
25 value 0 obey the original unicast message passing. For class value 1 , the entry in the 

table dq>ends on the location of the switch in the network. The raitry at each switch ^ 
mimics the actions of the corresponding node in the multiphase multicast of 
Example 1. 

30 Ateachnode, the table is obeyed for all packets entering the node. If a packet has 
class value 0, then the UDXY=1 000 identifies the packet as a unicast packet and 
only then is the destination of the packet examined. 
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For class value 1, switch (0,0) has the entry 001 1 . This assumes that the source node 
of flie multicast does not need another copy. The table &rnode (0,0) is illustrated 
below. 



For a packet at node (0,0) 


dass value 


UDXY value 


0 


1000 


1 


0011 



5 

Continuing with class value 1 for tiie other switches in the 3*3 torus, flie switch 
(0,1) has the entry 01 11. The four switches (0,2), (1,0), (1,1), and (1,2) have the 
entry 0101. Hie three switches (2,0), (2,1) and (2,2) have the eatxy 0100. The above 
is a complete encoding of the information required for the example multicast using 



10 class 1 . In short, packets with class value 0 obey the original unicast message 

passing. Packets originating from node (0,0) with class value 1 perform single phase 
multicast routing. 

The above UDXY values at each node for multicast from node (0,0) using class 1 is 
15 illustrated in Figure 3. At each node, the circle is open if D=0, that is, if no copy of 
the packet is to be deposited at the node. At each node, the circle is closed if 1>=1, 
that is, if a copy of the packet is to be deposited at the node. At each node, there is 
an arrow in the positive X direction, if X=l, that is, if a copy of the packet is to go 
out the positive X link. At each node, tiiere is an arrow in the positive Y direction,.if 
20 Y=l, that is, if a copy of tiie padcet is to go out the positive Y link. 

The second example described here is the same multicast described in Example 1 
from node (0,0,0) to tiie 125 nodes of the 5*5*5 cube with the comers (0,0,0) and 
(4,4,4). In Example 1 it is a three phase multicast; here it is a single phase multicast. 
25 Here the pattern of messages across the network is chosen to be similar to that of 
example 1. 

Each packet header has a field fi>r a class value. This value is either 0 or 1 . Eadi 

switch has a table used to detenxdne if the usual unicast routing of the padket is to be 

16 
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p^oimed or if flie actions of single phase multicast routing is to be performed. Each 
entry in the table is a bit string of the format UDXYZ. If m a table entry U is 1 , then 
fbs usual'unicast routing is to be performed, otherwise not If D is 1, then a copy of 
the packet is to be dq)osited at the local node, o&erwise not; If X is 1, then a copy 
S of the packet is to go out tiie positive X link, o&erwise not Similar for the bits Y 
and Z. The three links in the negative X, Y and Z diiection are irrelevant to the 
example and are ignored here for simplicity. 

For class value 0, the entry in the table is 10000 on all nodes. Thus packets with 
1 0 class value 0 obey the original unicast message passing. For class value 1 , the entry 
in the table depends on the location df the switch in the network. The entry at each 
switch mimics the actions of the corresponding node in die multiphase multicast of 
Example 1. . 

For class value 1, switch (0,0,0) has the entry 001 1 1. This assumes that the source 
15 node of the multicast does not need another copy. The three switches (0,0, 1 ) through 
(0,0,3) have the entry 01111. Switch (0,0,4) has the entry 01 1 10. The fifteen 
switches in the x=0 plane with the comers (0,1,0), (0,1,4), (0,3,0) and (0,3,4) have 
the entry 01 1 10. The five switches (0,4,0) through (0,4,4) have the entry 01 100. The 
75 switdies of the cube with the comers (1,0,0), (1,0,4), (3,0,0) and t3,0,4) have the 
20 entry 01 1 00. The 25 switches m the x=4 plane with the comers (4,0,0), (4,0,4), 
(4,4,0) and (4,4,4) have the entry 01000. The above is a complete encoding of the 
information required for the example multicast usmg class 1 . In short, packets witibi 
class value 0 ob^ ibe original unicast message passing. Packets origmating fiom 
node (0,0,0) with class value 1 perform single phase multicast routing. 

25 

- Li tiie above example of class routmg for single phase multicastmg, the UDXYZ bit 
string determines onto whidi output ports a packet is to be duplicated. A similar bit 
string is used in some existing implonentations of single phase miilticasting. An 
example is described in JELSivaiam, RJKesavan, D.K.Panda, CB.Stunkd, 
30 "Architectural Support for Efficient Multicasting in Irregular Networks", IEEE 
Trans. On Par. And Dist Systans, Vol.12, No.5, May 2001_. Another example is 
described in patent _US5333279: Self-timed mesh routing dhip wifli data 
broadcasting, D.Duiming_. In these existing impl^entations, a bit string similar to 

17 
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the above UDXYZ for each switch is in the packet header. la contrast, in the above 
class routing implementation, the packet header merely contains the class value 
which is used at each switch to look ujp in a table ihe UDXYZ entry. 

S The above class routing implemoitation of single-phase multicasting is in some 
ways less general than these existing implementations, but the class routing is in 
some ways more efiScient For example, in the packet header, a field for a class 
value is much smaller than a field for a bit string for each switch. In the above 
example, the class value is 0 or 1 and thus can be stored in a one-bit field in the 

1 0 header. In contrast, the above UDXYZ bit string would require a five-bit field in the 
header. Moreover, several fields for UDXYZ values woiold be required, since 
different switches have different values for UDXYZ. The smaller field in die header 
is more efficient since it consumes less of the physical bandwidth of the torus 
network, leaving more bandwidth for the application data. The smaller field also 

IS allows for a smaller latency, since typically at a switch, the entire header must be 
received and checked fin: errors, before the packet can be finrwarded. 
Example 5. Single Phase Multicast from Any Node in tbe Network 
The single phase multicast using dass routing described in Example 4 allows a 
single node to be the source of the message. In the example on the 2-dimeasional 

20 3*3torus, the source is the node (0,0). In flie example on the 3-dim«DsionalSi'5'''S 
torus, fbs source is fhe node (0,0,0). We'll name this a heterogeneoys smgle phase 
multicast, since the class routing table has dififerent values at different nodes. The 
table only is used for one of flie input links. 

25 Class louting also can be used to implement a single phase multicast where the ' 
source can be any node in the network. We'll name this a homogenous single phase 
multicast, since on a homogeneous network such as a torus the dass routing tables 
have the same value on every node. On a single node, Hbs dass routing tables have 
dififerent values on the different incoming links. 

30 The first example described here is the same multicast described in Example 4 from 
node (0,0) to the 9 nodes of the 3*3 torus illustrated in Figure 1 . In iSxample 4 it is a 
hetraogeneous single phase multicast; here it is a homogenous single phase 
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multicast Here the pattern of messages across the netwoikis chosen to be similar to 
&at of example 4. 

In the heterogeneoiis single phase multicast of example 4, a packet arriving at a node 
5 via any of the incoming links uses the same table to determine the actions to be 
performed by the switch on the padket based on fhs class value. As demonstrated m 
example 4, for the heterogeneous multicast, different nodes have different values in 
the table. By contrast, in the homogenous single phase multicast of this example, 
each incoming link on each switch has a table used to determine the actions to be 
10 performed on an incoming packet. As demonstrated below, for flie homogeneous 
multicast, different nodes have the same values in the tables. 

Each packet header has a field for a class value. This value is either 0 or 1 . Each 
incoming link on each switch has a table used to determine if the usual unicast 

1 5 routing of the packet is to be performed or if the actions of single phase multicast 
routing is to be performed. Each entry in the table is a bit string of the format 
UDXY. If in a table entry U is 1, then the usual unicast routing is to be performed, 
otherwise not. If D is 1 , then a copy of the packet is to be d^osited at the local 
node, otherwise not If X is 1 and the X-destinadon of the packet is not the X- 

20 location of the node, tiien a copy of the packet is to go out the positive X link, 

otherwise not If Y is 1 and the Y-destination of the padcet is not the Y-location of 
&e node, then a copy of the packet is to go out Has positive Y link, otherwise not 
For each node, the two outgoing links in the negative X and Y directions are 
irrelevant to the example and are ignored here for simplicity. For each node, fbe two 

25 incoming links in the negative X and Y directions are irrelevant to the example and 
are ignored here for simplicity. 

As described above, the X-destination and the Y-destination of the packet are 
determined in order to determine tiie actions performed on ibs packet Thus for node 
30 (0,0) to broadcast to all other 8 nodes of the 3*3 torus, the packet must have the 
destination (3,3). 
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In general for a broadcast in this example, the destination of the padcetis the 
fur&est node in the positive X and positive Y direction finm the source of the 
broadcast For example, for node (1,0) to broadcast to all other 8 nodes of tiie 3*3 
torus, the packet must have the destination (0,2). 

5 

For class value 0, the entry in tiie table is 1000 on all tables on all nodes. Thus 
packets with class value 0 obey the original unicast message passing. For class value 
1, the entry in the table depends on which incoming link the packet arrived on. The 
tables are illustrated below. The entry for each incoming link are such that the 
1 0 resulting homogeneous multicast mimics the hetax)geneous multicast of Example 4. 



For a packet incoming on the link from the negative x direction 
class value UDXY value 

0 1000 

1 0111 



For a packet incoming on fhe link from the negative y direction 
class value UDXY value 

~0 1000 
1 0011 
15 '■ 

The above is a complete encoding of the information required for the example 
multicast using class 1 . In short, packets with class value 0 obey the original unicast 
message passing. Packets witii class value 1 perform a homogeneous single phase 
multicast routing. 

20 

Given the above 2-dimensional torus example, the technique is easily extended to 
other n^orks. Class 1 in the above example can be considered to provide 
multicasting in the positive X and positive Y quadrant of a mesh. Three additional 
similar classes 2, 3 and 4 could provide multicasting in the other tiu%e quadrants: 
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negative X and positive Y; positive X and negative Y; as well as negative X and 
negative Y. These four classes allow any node in tiie mesh to use four multicasts to 
efiEectively broadcast a packet to all other nodes in the mesh. Using the same 
broadcast technique on the torus would be twice as fast as the single class technique 
S described above. It is twice as fast since the distance between fhe source node and 
the destination nodes is halved. This technique is feasible since any node on a torus 
can be treated as a node in the middle of a mesh. 

The above technique is easily generalized to a mesh or torus of D dimensions. On a 
10 D dimensional mesh or torus, classes allow any node in the mesh or torus to use 
2^D multicasts to effectively broadcast a packet to all other nodes in the mesh or 
torus. On the torus, the alternative single broadcast to all the nodes will require 
twice as long to complete as the 2^D multicasts on the torus since the distance 
between the source node and the furthest destination is double for the single 
IS broadcast. 

Enhancements and Alternatives to Class Tables . 

Instead of or in addition to using tables on tiie switch, tiie class value and perhaps 
other characteristics of the packet can be input to an algorithm. If table raitries are 
20 tile same for all class values, then it migjht be bettra- to use a algorithm If a switch 
needs to decide between conflictmg actions donanded by tables, as which can be 
programmed with tiie relative priorities of different tables. 

Using Class-based Mnltj castiii ff to Create other Classes 
25 In Example 5, class value 0 is used for the usual unicast, while class value 1 can be 

used to broadcast to all nodes in the torus. Having established a broadcast 

mechanism, it can be used to broadcast any data. For example, this dataeould be the 

class table entries for other classes. For exaiDa.ple; Example 5 identified a need for 

tiie additional classes 2,3 and 4. Once multicasting on dass 1 is established by 

30 whatever means, class 1 can be used to create classes 2,3 and 4. In general, once 

communication on a particular class value or vialues is established, that 

communication can be used to establish communication on other class values. 
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Example 7. Dense Matrix Calciilation using Class Fiincttoii 

The present invention also uses the class function on a torus computer network to do 

dense matrix calculations. By using the hardware implemented class function on the 
torus compute network it is possible to do high performance dense matrix 
5 calculations. 

Class function is the name used in this example for multicasting based on class 
network routing. Often, the multicast is to other nodes in the same row. So often it 
is sufficient for class routing to implement a single phase of path-based multidrop 
10 message passing, which is described in Example 1 . When ttie multicast is not to a 
row, it is to a plane, ciibe or other higher dimension subset of the torus or mesh. In 
this case, optimal performance demands that class routing implement a more 
sophisticated multicast, such as the single phase multicast described in Example S. 

IS The present invention makes dense matrix inversion algorithms on distributed 
memoiy parallel superconiputers with hardware class function capability perform 
foster. This is achieved by exploiting the fact that the communication patterns of 
dense matrix inversion can be served by hardware class ftmctions. This tesxHta in 
faster execution times. 

20 The algorithms as discussed herein are well known in the art, and are discussed, for 
example, in NUMERICAL RECIPES IN FORTRAN, THE ART OF SCIENTIFIC 
COMPUTING, Second Edition, by William H. Press, et al., particularly at page 27 
et. seq. 

25 Figure 4 illustrates a 4 x 4 grid of processors wherem each processor is labeled by its 
row, column numerals. For example the processor in row 2 column 3 is p (2,3). The 
column i and row i are also shown (shaded areas) as well as the directions that the 
column/row has to be sent via the class fimction. 

30 One can invot a dense linear matrix using standard algorithnis such as Gau^s-Joidan 
elimination as well as other methods. la general the I/O required is of a special one- 
to-many variety that is wdU suited to the communication fimctionality of a parallel 
supercomputer witii hardware class function capability. One can utilize the class 
functionality to multicast data to an entire row or surfoce of the machine. 
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Some of the terms used in the description of this invention are ejqplained below: 



TIte Gauss-Jordan algorithm: 
5 The kernel of the Gauss- Jordan algorithm without pivoting is given below. Initially 
b is an identity matrix and a is the matrix whose inverse is being computed. 

do i=l,N 
doj=i,N 

10 do k=l,N; (knot equal to i) 

b(kj) = b(kj) - [a(k,i) / a(i,i)] * b(ij) 

a(kj) = a(kj) - [a(k,i) / a(i4)] * a(ij) 

enddo 

eoddo 
IS enddo 

Equation 1 

Distributed meniorypamlUl supercomputer. 
20 S\ich a computer consists of many nodes. Each node has one or more processors 
that operate on local memory. The nodes are typically connected as a d-dimensional 
; grid and tbey communicate via the grid links. If the grid is 2-dimensional with PxP 
processors then an NxN matrix can be partitioned so tiiat LxL pieces of it reside on 
each node 

25 (L=N/P). Iftheiiiadiine is not coniiected as a 2-dimensional grid the problem can 
always be mapped onto it by appropriately "fijlding" the matrix onto the grid. 
Without loss of generality and in order to make Has presentation of tiiis invention 
simple the processor grid will be assumed to be 2-dim«)sional. 

30 Hardware class functions: 

Class fimctions are a hardware implementation of multicast. Suppose that processor 
p(l,l) (here the numerals indicate the position of the processor on the grid, also see 
Figure 4) wants to send the same packet of data to processors p(l;,2), p(l,3) and 
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p(l,4). Typically this is done by first sending fhs data to processor p(l ,2). Once the 
data arrives into p(l^) software routines read it and store it in memory. Then p(l^), 
reads the data from memory and sends it to p(l,3) etc.. The problem with this is that 
it takes a long time to fiiUy receive the packet of data into memory and then i^send 
5 it. If the hardware was built so fliat the packet of data that arrived into p(l,2) was 
simultaneously stored into the p(l,2) memory and immediately sent to p(l,3) tiien 
the delay would be greatly reduced. The hardware function of p(l,l) sending a 
packet of data to p(l,4) while that packet is deposited into the memory of the 
intermediate processors that it goes througih is called the hardware class function. 

10 

ITie invention: 

This invention exploits the &ct that ihs communication patterns of dense matrix 
inversion (for example using &e Gauss-Jordan method) can utilize class functions. 
1 S This can be seen ftom equation 1 that describes the Gauss- Jordan algorithm: 

The a(i,i) are communicated via some other method, for example a global broadcast. 
Then the right hand side of the equations for b(kj) and a(kj) involve elements that 
have only one index different from (k j) but not both ( a(k,i), a(i j) and b(i j)). Class 
20 function conmiuDication can be used to send such elements aax)ss the relevant 
processors. 

For example, in order to calculate b(kj) for a given row k (1< j <N) one needs 
a(k,i) to be known for all processors that contain the row k. Therefore, one must 
send a(k,i) along the row of processors that contain the matrix row k. This can be 
25 done using the class functionality. As aheady discussed this results in large 
reductions in total communication time. 

This completes the description of the idea for this inventioiL The idea was described 
for the Gauss-Jordan algorithm but it is not specific to it For example this idea 
30 applies to the "Gauss-Jordan with Pivoting", "Gaussian Elimination with Back 
Substitution" and "LU Decomposition" algorithms. 

An implementation of this idea (using the Gauss- Jordan algorithm) with all the 
details is presented below as an example. In order to make the example easy to 
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understand fhe simplest implementation was chosen. More comply 
implementations that result in communications involving larger data packets have 
also been wotked out Depoiding on the size of the processor grid and the size of the 
matrix larger packet sizes may be desirable since they fiirtha: improve performance 
5 by minimizing ItAeacy. However, tins does not affect fhe premise of this idea. 

An example algorithm: 

The Gauss- Jordan algorithm is used to find the matrix inverse of a dense matrix of 
10 size NxN uniformly spread out on a grid of PxP nodes. Therefore each node has an 
LxL piece of the matrix in its memory (L=N/P). A hardware class function is used to 
multicast data across rows and columns. For a visual picture of this algorithm please 
refer to Figure 1 above. 

15 Foreadil<i<N 

1) Using class functions said to the left and right the column i of a's ( a(k,i), 1< k < 
N) 

2) Scale the elements a, b of row i by a(i,i) 
20 . 

3) Using class fimctions send up and down the new row i of a's and b's ( a(ij) and 
b(iJ),l<j<N) 

4) Now all processors have the necessary elements to do the standard Gauss-Jordan 
25 step for column i. At tfie end of this column i is the same as column i of the identity 

matrix. 

Repeat 

30 End of examples: 



While sev^al embodiments and variations of the present invention for class 
networking routing are described in detail herein, it should be apparent that the 
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disclosure and teachings of &e present invention will suggest many alternative 
designs to those skilled in the art 
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CLAIMS 

Having thus described our inveation, what we claim as new and desire to secure by 
Letters Patent is: 



1 

1 I . A method of class network routing in a network to allow a compute 

2 processor in a network of compute processors located at nodes of the network to . 

3 multicast a message to a plurality of other compiite processors in tiie network 

4 comprising: 

5 dividing a message into one or more message packets which pass 

6 through the network; 

7 adding a class value to a message packet; 

8 at each switch in the network, using the class value as an index to at 

9 least one table whose stored values, or as an input to an algorithm whose generated 
1 0 values, determine actions performed by the switch on the message packet. 

1 2. The method of claim 1 , including providing a class value determining a 

2 switch action of path-based multidrop message passing for multiphase multicasting 

3 of a message packet through Ifae netwoik, to .detemiine if a local node should dqposit 

4 a copy ofthe message packet at the local node. 

1 3. The method of claun 2, including: 

2 providing a class value to implement multidrop message passing; 

3 providing each switch with a table to determine if a copy ofthe 

4 message packet is to be deposited at the local node. 

1 4. The method of claim 1 , including providing a class value detennining a 

2 switch action of multidestination message passing of a message packe^t to multiple 

3 destination nodes in the network. 

1 5. The method ofclaiml, wherein a switch duplicates an inconungpadcet onto 

2 multiple outgoing links. 
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1 6. The method ofclaim 5, including providing a class routing tab^ 

2 different values on different incoming links. 

1 7. The method of claim 4, wherein the message packet is multicast to an entiie 

2 roworsuT&ceofthenetwo±. 

1 8. The me&od of claim 1, including: 

2 pofenning dense matrix inversion algorithms on a network of 

3 distributed memory parallel computers with hardware class function multicast 

4 capability, wherein the hardware class function multicast capability simultaneously 

5 stores into memory a message packet that arrives and immediately sends the 

6 message packet to one or more other nodes while that message packet is being 

7 stored into memory, such that the cotmnunicatio'n pattems of the dense matrix 

8 inversion algorithms are served by ths hardware class function multicast capability 

9 to minimize communication delays. 

1 9. The method of claim 1 , whereiQ the network comprises a network of 

2 distributed-memoty parallel computers; 

3 providing each node ofthecomputernetwork with one or more 

4 processors that operate on local memory; 

5 coordinating the actions ofmultiple nodes ofttie computer by using 

6 class roiitmg to pass messages between the multiple nodes. 

1 10. The method of claim 9, including: 

2 pairing each node with a switdbi of the network; 

3 CQimecting the switches to form a three dimensional torus wherein 

4 each switch is linked to six other switches, fbs links are coupled to a switch in a 

5 positive direction and also to a switch in a negative direction in eadb of the three 

6 dimensions; 

7 idoitifying each switch by an x,y,z logical address on tiie torus, 

8 wherein each node has the address of its switch; 

9 including a field value for the logical address in the packet header, to 
1 0 enable the packet to identify a destination node. 
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1 11. The method of claim Ijincliidiiigiismg a I>-phasemiilticastfix>m an 

2 nodetoalInodesofaI>dimeiisioiudcd)ewherein,ma&stph^ 

3 sends a multidrop message to all o&er nodes in one of the rows of the sending node, 

4 in a second phase each of &e redpioats of the first phase and the sender of tibe first 

5 phase simultaneously send a multidrop message to all other nodes in a row 

6 orthogonal to the row of the first phase, in a third phase each of the recipients of the 

7 second phase and the senders of the second phase simultaneously send a multidrop 

8 message to all other nodes in a row orthogonal to flie rows of the first and second 

9 phases, and so on in ilirther phases sudi that all node of the cube receive the 
1 0 broadcast message after all the phases. 

1 12. The method of claim 1 , including providing each switch with a table with 

2 associated class values which detemiine if a copy of a message packet is- to be 

3 deposited at a destination node. 

1 13. The method of claim 1, for a D-dimensioiud network, including providing 

2 2^D class values for multicast ia each of the 2^D directions to allow each node in 

3 the n^ork to use 2^Dmulticasts to effectively broadcast a packet to all other nodes ■ 

4 in the mesh. 

1 14. . The method of daim 1, induding providing a class value determining a 

2 switdi action of a unicast of a message padcet through the network to a single 

3 destination node. 

1 IS. The method of claim 1 , including providing a dass value to oiable a node to 

2 acquire and store information padcets passing through itd switdi to provide 

3 information on the performance of tiie network. 

1 16. The method of claim 1 1 , induding providing class values to detectnine if a 

2 copy of the message packet is to go out on an X link or not, and out on a Y link or 

3 not, and out on a Zliiok or not and so on for the other links oftheD dimensions. 
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1 17. The meHiod of claim 1 , including providing dififerent tables and providing 

2 priorities for different tables to enable a switch to decide between conflicting actions 

3 indicated by different tables. 

1 18. The method of claim 1, including using a class value as an input to an 

2 algorithm for determining a switch action of the switch. 

1 19. The method of claim 1 , including using class-based multicasting to -cxesis 

2 other classes, such that the contents of a table for a particular class value is 

3 determined by using another class value. 
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